[codex] disable Nagle on Rendezvous WebSockets#30269
Conversation
|
CI diagnosis: the broad Bazel, clippy, release, and argument-comment jobs are failing on the PR base, before reaching this two-line change. The shared error is |
Codex Cloud Agents (CCA) couldn't complete this review. The original Codex Review is unaffected. |
| connect_async_with_config( | ||
| request, | ||
| Some(noise_relay_websocket_config()), | ||
| /*disable_nagle*/ false, |
There was a problem hiding this comment.
put a pr comment explaining this
There was a problem hiding this comment.
Rendezvous carries small, latency-sensitive relay and JSON-RPC frames. With Nagle enabled, a small write can wait for an ACK or for more data to coalesce, adding delay directly to request/response turns. In three staging runs of 30 steady-state process/read calls per configuration, setting TCP_NODELAY improved p50 from 139.1 ms to 81.5 ms and p95 from 162.0 ms to 95.8 ms.
We are accepting the modest increase in small packets because the current connection volume is low. Existing latency, error, packet, and CPU monitoring will catch regressions, and rollback is a normal code revert. The companion server change applies the same setting to accepted Rendezvous sockets: https://github.com/openai/openai/pull/1082463
5a1272c to
41213bb
Compare
owenlin0
left a comment
There was a problem hiding this comment.
🤷 trust that this is the right thing to do?
…dezvous-tcp-nodelay
Summary
Disable Nagle unconditionally for both exec-server Rendezvous WebSocket connections.
disable_nagle=trueat the executor and harness connection call sitesThe companion internal PR enables
TCP_NODELAYon accepted Rendezvous sockets: https://github.com/openai/openai/pull/1082463Why
Rendezvous carries small, latency-sensitive relay and JSON-RPC frames. Three staging runs of 30 steady-state
process/readcalls per configuration measured p50 improving from 139.1 ms to 81.5 ms and p95 from 162.0 ms to 95.8 ms with Nagle disabled.The expected packet overhead is small at the current connection scale. We will use existing latency, error, packet, and CPU monitoring and revert normally if production regresses.
Rollout and rollback
The client and accepted-socket changes can deploy independently. New connections receive the setting as each side deploys. Rollback is a normal code revert; there is no persisted assignment or gate state to unwind.
Validation
just test -p codex-exec-server --lib: 164 passedjust fix -p codex-exec-server: passedjust fmt: passed